Skip to main content

Worldwide Web Crawls

Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites.



rss RSS
DESCRIPTION

Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites.

Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites.

Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for each Crawl.

Worldwide Web Crawls are run using the Heritrix software.

In addition various rules are also applied to the logic of each crawl. Those rules define things like the depth the crawler will try to reach for each host (website) it finds. In general terms the crawling software will identify all the URLs on each page it captures, follow those links, attempt to capture those pages, identify new URLs, follow those links, etc., till the crawl is stopped or pre-set conditions like site depth limits are reached. For the most part a given host will only be captured once per Worldwide Web Crawl, however it might be captured more frequently (e.g. once per hour for various news sites) via other crawls.


ACTIVITY

Created on
October 5
2010
ARossi
Archivist
ADDITIONAL CONTRIBUTORS
kngenie
Archivist

Total Views 11,990,313,622

DISCONTINUED VIEWS

Total Views 11,956,069,205

ITEMS

Total Items 626,560

TOP REGIONS (LAST 30 DAYS)

(data not available)